SCR algorithm 



Page 1 of 2 



6 PORTAL 



US Patent & Trademark Office 



Su bs cribe (Full Service) Regi s t er (Limited Service, Free) Login 
Search: ® The ACM Digital Library O The Guide 



m mm mmm. mrnrnm 



Feedback Report a problem Satisfaction 
s urv e y 



SCR algorithm: saving/restoring states of file systems 

Full text , ®Pdf(556 KB) 

Source ACM SIGOPS Operating Systems Review archive 
Volume 33 , Issue 1 (January 1999) table d 
Pages: 26 - 33 
Year of Publication: 1999 
ISSN:01 63-5980 



W ei X i ao -H ui 
Ju Jiu-Bin 



Authors 

Publisher ACM Press New York, ny, usa 



Additional Information: abstract index terms collaborative colleagues peer to peer 
Tools and Actions: Discussions Find similar Articles Review this Article 



DOI Bookmark: 



Save this Article to a Binder Display in BibTex Format 

Use this link to bookmark this Article: http://doi .acm.Org/1 0.11 45/309829.309839 
What is a DOI? 



* ABSTRACT 



Fault-tolerance is very important in cluster computing. Many famous cluster-computing systems have 
implemented fault-tolerance by using checkpoint/restart mechanism. But existent checkpointing 
algorithms can not restore the states of a file system when roll-backing the running of a program, so 
there are many restrictions on file accesses in existent fault-tolerance systems. SCR algorithm, an 
algorithm based on atomic operation and consistent schedule, which can restore the states of file 
systems, is present in this paper. In SCR algorithm, system calls on file sytems are classified into 
idempotent operations and non-idempotent operations. A non-idempotent operation modifies a file 
system's states, and an idempotent operation does not. SCR algorithm dynamically follows the tracks 
of a program's running, logs each non-idempotent operation used by the program and the 
information that can restore the operation in disks. When checkpointing roll-backing the program, 
SCR algorithm will revert the file system states to the last checkpoint time. By using SCR algorithm, 
users are allowed to use any file operation in their programs. 
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SCR ALGORITHM: SAVING/RESTORING STATES OF FILE SYSTEMS 



WEI Xiao-Hui JU Jiu-Bin 

Wei@,dcsjlu,edu,cn jjb@mail.jlu.edu.cn 
(Department of Computer Science, Jilin University, Changchun 130023, China) 

Abstract Fault-tolerance is very important in cluster computing. Many famous cluster-computing systems 
have implemented fault-tolerance by using checkpoint/restart mechanism. But existent checkpointing 
algorithms can not restore the states of a file system when roll-backing the running of a program, so there are 
many restrictions on file accesses in existent fault-tolerance systems. SCR algorithm, an algorithm based on 
atomic operation and consistent schedule, which can restore the states of file systems, is present in this paper. 
In SCR algorithm, system calls on file sytems are classified into idempotent operations and non-idempotent 
operations. A non-idempotent operation modifies a file system's states, and an idempotent operation does not. 
SCR algorithm dynamically follows the tracks of a program's running, logs each non-idempotent operation 
used by the program and the information that can restore the operation in disks. When checkpointing roll- 
backing the program, SCR algorithm will revert the file system states to the last checkpoint time. By using 
SCR algorithm, users are allowed to use any file operation in their programs. 
Key Words: fault-tolerance , checkpointing, atomic operation, recoverability of file systems. 

1 Introduction 

Cluster computing systems, such as PVM [11 , have been more and more important parallel computing 
environments due to their good performance/cost and large computing ability. However, the more machines 
are involved in a computing, the fault rate is higher. Hence, improving the reliability of cluster computing 
systems is necessary. To provide a reliable computing environment to users, many cluster computing systems 
realized fault-tolerance by using checkpointing algorithms, such as Condor [2 - 31 , Mist [4 * 5] , CoCheck [61 , Fail-safe 
PVM I?1 , and Dome [8 > 9] etc. 

A computing is composed of data and program. Checkpointing is usually used to improve the program's 
reliability. Periodically, checkpointing algorithms take global checkpoint to save the current running states of a 
program in the stable storage. When a failure occurs, checkpointing algorithms roll-back the program to the 
last checkpoint cut to mask the failure. 

Data are usually saved in files. Usually, two methods are used to improve the availability of a distributed 
file system I10] . One is to make file systems recoverable, the other is to make file systems robust. A file is 
recoverable if it is possible to revert it to an earlier, consistent state when an operation on the file fails or is 
aborted by the client. Recoverable files are realized by atomic updated techniques, and mainly used in data 
base systems. A file is called robust if it is guaranteed to survive crashes of storage device and decays of the 
storage medium. Robust files are implemented by redunduncy techniques such as mirrored file and RAID [ll * 14] 
etc. 

In cluster computing systems it is necessary to realize recoverability of file systems for supporting fault- 
tolerance. File systems are very important environments for programs. During running, a program needs access 
file systems for reading input or writing output from time to time. So incorrect file system states will lead to 
program's error running. Existing fault-tolerance systems only rollback a program's running states and dose not 
restore the file system states correspondingly. If the program executed the operations that modified the file 
system states, such as delete, rename, the program would rerun in a changed file system environments. 
Then the program's running is unpredictable. For this reason, Condor, Mist and other fault-tolerance systems 
disallow their users to access file systems arbitrarily except read-only or write-only operations. Such 
restrictions prevent general applications from using checkpointing to improve reliability. In [15], random write 
and read are considered as unrecoverable operations. [4] thinks that write-and-copy technology may be used to 
resolve the problem, but no further work is reported. 

In the paper, SCR algorithm, which is based on atomic operation and consistent schedule, is presented. 
SCR algorithm develops the concept of file systems recoverability in that it can revert a file system's states 
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changed by a program's running, not only by a single file operation. By using SCR algorithm, synchronous 
checkpointing algorithms 1161 may allow arbitrary file accesses in user applications. 

The rest of the paper is organized as the following: Section 2 gives out the important concepts and 
definitions. SCR algorithm and its proof of correctness are given in section 3. In section 4, an implementation 
example of SCR algorithm is presented. Section 5 discusses SCR algorithm's performance, and the last section 
concludes the paper. 

2 Concepts and Definitions 

Two general strategies can be used to realize saving/restoring the states of a distributed file system 
corresponding to a distributed program's fault-tolerance running. One is static save/restore strategy (SSR), the 
other is dynamic save/restore strategy (DSR). In SSR, when checkpointing mechanism checkpointing a 
program, a copy of the current file system states are also saved in stable storage. When checkpointing 
mechanism roll backing the program after a failure, the file system states are replaced by the copy. DSR 
dynamically follows the tracks of a program's running, and logs every such file operation that modifies the file 
system states and the necessary information that can undo these operations in stable storage. When 
checkpointing mechanism roll backing the program, DSR revert the file system states to the last checkpoint by 
undoing all the file operations used by the program. 

In general, it is unpredictable how a program will change the file system states during its running. Hence, 
SSR should save the states of all the file system that are permitted the program to access. Then a lot of 
unnecessary information is saved, so SSR is much inefficient. Moreover, the high disk space overhead makes 
SSR difficult to implement. On the contrary, DSR dynamically keeps the tracks of a program's real running, 
only the useful information is saved. So, DSR is much more efficient and practical than SSR. SCR algorithm is 
just based on DSR strategy. 

In Unix, a file system is composed of files and directories. Applications access file systems by using 
system calls provided by OS. In SCR, system calls related to file access are classified into idempotent 
operations and non-idempotent operations. SCR defines a save operation and an undo operation for each non- 
idempotent operation. A set of stacks, called undo stacks in SCR, that work on stable storage (such as disks) 
are used to save information that can undo the file system changes by a distributed program's running. 

Definition 1 A file system's states 

A file system's states are its directories' and files' name, contents, location, owner, mode, and link number 
etc. These characters do not rely on any program's running. SCR is responsible for saving and restoring such 
features of a file system. (Note: A file system's states do not include a file's open mode, open file handle, read 
and write pointer's offset etc. Because these states can not exist independent of a program's running, these 
states should be a part of a program's running states. So, they should be saved/restored by the checkpointing 
mechanism.) 

Definition 2 Idempotent operations and Non-idempotent operations 

All system calls about file accesses are classified into idempotent operations and non-idempotent 
operations. Non-idempotent operations are all the system calls that may modify a file system's states. 
Idempotent operations are all the syatems on file systems that never change a file system's states. For 
examples, in SunOS 4.1.3, createO, mkdir(), removeO, rmdir(), rename(), write(), chownO, chmodO, link(), 
and unlinkO etc are non-idempotent operations; readO, accessO, fstat(), stat() etc are idempotent operations. In 
a broad sense, all idempotent operations are some kind of read operations, and all non-idempotent operations 
are some kind of write operations. 

Definition 3 NOP 

Suppose nopi (i=l, ... ,n) is an non-idempotent operation, opj(j-l, — ,m ) is an idempotent operation. 
NOP is such a set of file operations that Vnopj (i=l, ... ,n) eNOP, and Vopj(j=l, m) gNOP. 
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Definition 4 Undo Stack 

An undo stack, composed of a set of local undo stacks, works on stable storage such as disks. The data 
unit of undo stacks is called undo structure in SCR. An undo structure logs a non-idempotent operation's 
name, and the necessary information that can undo the non-idempotent operation. In SCR, every machine 
holds a local undo stack for a distributed program. When the program executes a non-idempotent operation to 
a machine's file system, the machine's local undo stack for the program will be pushed into an undo structure 
at the same time. 

Definition 5 Save operations and Undo operations 

A save operation (sopO and an undo operation (uopO are defined for each nop^ Sop;S and uopjs are all 
atomic operations. A sopi creates and fills the undo structure for the correspondent nop i and pushes it into the 
correspondent local undo stack, then finishes the file access of the nopi. An uopj undoes the nop ; according to 
the nopj's undo structure, then pop up the undo structure from the corespondent local undo stack. 

3 SCR (Save, Clean, Restore) Algorithm 

A distributed program is composed of a set of concurrent processes running on different machines. 
Consistent schedule is used to coordinate the concurrent file operations of a program. Each machine runs a 
manage process (MP) to manage the local file systems. 

Al gorithm Description 

Every MP holds a local undo stack for a program, which is NULL at the beginning. 

During normal running, when a process of a program is to execute a nop; to a machine's file system, the 
process sends a requirement to the machine's MP, and waits for the result. When a process of a program is to 
execute an op is the process operates the file access directly. 

At normal time, MPs waits for nopj requirements in loops. MPs execute the correspondent sopj and return 
the result for each nopi requirement in FIFO order. When executed, a soft will push the undo structure for the 
noft into the local undo stack. 

When a distributed program taking global checkpoints, every MP cleans its local undo stack of the 
program. 

When a distributed program roll-backed, each MP do uopjS in loops, until all the undo structures are 
popped up from the local undo stack. 

At the end of the program running, MPs clean up ail undo stacks of the program. 

Prnnf of Correctness , _ 

Suppose a distributed program is composed of q concurrent processes running on p machines (q>-p), and 
at the i th checkpoint time the file system states are Sj. Si = {Sl b S2 is SPJ, Ski(k=l,...,P) are the machine k's 
local file system states. S 0 are the file system states at the beginning of the program running. According to 
SCR algorithm, at the i th checkpoint time all the local undo stacks of the program are null and after the i th 
checkpoint: 

1. During normal running, when a process of the program executes a noft, the nopj's undo structure 
will be pushed into the very machine's local undo stack. In another words, when a machine's file 
system's states are to be changed, the necessary information that can undo the changes will be saved into 
the machine's local undo stack at first. 

2. When a failure occurs, suppose the global file system states are Sm = {Sl ml , SP mp }. Without 
losing generality, we also suppose at this time the machine k > s(k=rl,...,P) local undo stack are 
{ud_info[l], ud__info[mk-l], ud_info[mk]}. From 1, we know the program has executed mk 
nops(nopi „ nopi ^ since the last checkpoint time(the i th checkpoint). Because idempotent 
operations never vary a file system states, and all file operations in SCR are atomic operations, so the 
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vary procedure of machine k's file system must be Sk i 0 -> Skj t ->... ->Sk { ^(note: Skj 0 is Slq, and Skj ^ 
is Sk^). Sk i _ (j . !) ->Sk iJ (j=l,...,mk) is caused by nop iJ (j=l ) ... 5 mk). 

3. When the program roll-backed, machine k's MP will executes mk rops: ropj ^ ..^rop; And machine 
k's file system states will take the following changes, Skj ... , ->Skj _,->Skf. So the global file system 
states are reverted to Si, which are just the file system stiates at the last checkpoint time. At the same time 
all undo stacks of the program are cleaned up. 

4. When it is time to take a new checkpoint, the global file system states are S i+1 . At this time all undo 
stacks of the program are cleaned, then when the next failure occurs, the global file system states will be 
reverted to S i+1 by SCR. 

5. At the end of the program running, no information need saving. So, SCR clears all of the program's 
undo stacks. 

Above all, SCR algorithm can revert the file system states related to a program to the last checkpoint time 
when the program is roll-backed. By using SCR algorithm, user applications can use any file operation in a 
fault-tolerance system using synchronous checkpointing algorithms. 

4 An implementation example of SCR 

In the section, we present an implementation of SCR algorithm, which is realized on DPVM [I7,I8] . DPVM is 
an enhanced PVM on SunOS 4.1.3, which realized process migration and fault-tolerance by using synchronous 
checkpointing algorithm. 
1 Define undo stack 

All local undo stacks work on disks. 
(1) Undo Structure 
Structure undo_info{ 
int opflag; /* 



*/ 

char init_fn[256]; 
char effect_fh[256]; 
char temp_fn[256]; 

} 



opflag - 1: writeO operation; 
2: delete() operation; 
3:rename() operation; 



/* file name string */ 
/* file name string */ 
/* file name string */ 



(2) PUSH operation 

Push an undo_jnfo structure on the top of the local undo stack. 

(3) POP operation 

Pop the top undo_info structure from the local undo stack. If the undo_info structure's temp_fn item is 
not null, delete the file whose name is temp_fn. 

2. System calls of SunOS4.1.4 about file operations are divided into idempotent operations and non- 
idempotent operations. In the section, we only use writeO, deleteO and renameO system calls as non- 
idempotent operation examples. 



2.1 Define NOP 

NOP= {write(), deleteO, rename()}. 

2.2 Define the save and undo operation for writeO system calls 
(1) save operation - sopl 
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A file is divided into pages of the same size, and every page has a flag to point out whether the page has 
been changed from the last checkpoint. Sopl first determine whether all of the pages are to be written have 
marked with changed_flag. (a) If there are some pages without being set with changed_flag, sopl creates a 
new undojnfo structure and a new temp file. Sopl saves such pages of the file into the temp file, and set these 
pages with changed_flag at the same time. Then sopl set the undo_jnfo structure's opflag with 1, init_fh with 
the name of the file is to be written into, temp_fn with the name of the temp file. After that, sopl executes 
PUSH operation and write() operation, (b) If no such pages, sopl simply executes write() operation. 
(2) undo operation - uopl 

Uopl writes the pages saved in the temp file whose name is temp_fn back to the file whose name is 
init_fn, clear these pages' changed_flag, and executes the POP operation. 

2.3 Define the save and undo operation for delete() system calls 

(1) save operation - sop2 

Sop2 creates a new undo_info structure and generates a temp file name. Then sop2 set the undo_info 
structure's opflag with 2, init_fh with the name of the file is to be deleted, temp_fh with the new generated 
temp file name. After that, sop2 executes the rename operation: rename init_fh temp_fh, and executes the 
PUSH operation. 

(2) undo operation - uop2 

Uop2 executes the rename operation: rename temp_fh init_fh, and the POP operation. 

2.4 Define the save and undo operation for renameO system call 

(1) save operation - sop3 

Sop3 creates a new undo_info structure, and set the structure's opflag with 3, init fh with old name of the 
file, effectjh with the new name of the file. Then sop3 executes renameO operation and PUSH operation. 

(2) undo operation - uop3 

Uop3 executes the rename operation: rename effect_fn init fn, and the POP operation. 

3. Add a wrap to each system call(fjsyscallO) on file systems: 
if (f_syscall0eNOP) then{ 

send the f_syscall() requirment to the correspondent MP; 
wait for the result; 

}else{ 

call f_syscall(); 

} 

4. Design the working flow of process MP 

(1) During the normal running time 

While(l){ 

Wait for fsyscall() requirments; 
Execute the save operation for f_syscall(); 

Return the results to the process who sent the f_syscall() requirment; 

} 

(2) At the time of a program is roll-backed 

While (the local undo stack is not NULL) { 

Execute the undo operation according to the top undo_info structure of the local undo stack; 

} 

clear the changed_flag of every page of all changed local files; 
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(3) At a new checkpoint time or at the end of a program running 

While (the local undo stack is not NULL){ 

Executes the POP operation according to the top undo_info structure of the local undo stack; 

clear the changed_flag of every page of all changed local files; 

5 Discussions on performance 

SCR algorithm's overheads lie in two aspects. One is caused by consistent schedule, the other is caused by 
logging and clearing undo structures dynamically. To reduce the overheads of the first respect relies on 
improving the schedule policy, and it is our further work. And the following discusses two strategies to 
reducing the overheads of the second respect. 

1. Reducing the information logged in undo structures for each non-idempotent operation. For example, a 
file is 1M bytes and an application will read and write the file during its running. To reducing the information 
logged in the undo structure for write() system call, the file is divided into pages like sopl( section 4). If the 
page size is 50k bytes and a write() system call writes 50K bytes into the file, then the writeO's undo structure 
need logging 100K bytes(2 pages) at most not 1M bytes. 

2. Reducing the undo structures need logging in undo stacks. For example, if you have some knowledge 
about an application's activities, you can exclude some non-idempotent operations from NOP aggregate. 
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Figure 1 write-only operations 

Figure 1 is a program that generates prime numbers. The program writes 002, 003, into the output file 
during its running. Because the program only executes write operations on the file, so we can exclude writeO 
system call from NOP. Then during the program's fault-tolerance running, nothing will be logged in its undo 
stacks. However, this doesn't effect the fault-tolerance function. Refer to figure 1, at the begin of the 
program's running, the output file is NULL(figurel(l)). At the checkpoint time, the output file write pointer's 
offset is LI (figure 1(2)). And when a failure occurs, the file write pointer's offset is L2(figurel(3)). When the 
program is roll-backed, the write pointer's offset is set to LI, which is the value at the last checkpoint time. 
However, because nothing is logged in the program's undo stacks, so the output file doesn't revert to the last 
checkpoint time(figurel(4)). Because the program only executes writeO operation to the file, the program still 
can finish correctly(figurel(5)). 
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However, for some application NOP must include write() system call, otherwise the program's running is 
unpredictable. Refer to figure 2, a file's contents are a list of char ( c'(figure2(l)); the program executes the 
following operations on the file: change all char V to 'd\ change all char 'd' to 6 c\ At the checkpoint time, 
the file write pointer's offset is Ll(figure2(2)). And when a failure occurs, the pointer's offset is 
L2(figure2(3)). As the program is roll-backed, the write pointer is set to LI. If NOP does not include writeO 
system call, the file's contents will not be reverted to the last checkpoint time(figure2(4)). Then, at the end of 
the program's running the file content will be figure 2(5). Clearly, it is not the expected result, which should be 
a list of char 'd'.The above two examples show when NOP must include writeO system call and when needn't. 
In Condor system etc, users are allowed to use read-only or write-only operations to files, but random write 
and read operations are inhibited. The reason is also similar with the above description. 

6 Conclusion 

File systems are very important environments for programs. During running, a program needs access file 
systems for reading input and writing output from time to time. Existing fault-tolerance systems only rollback 
a program's running states and dose not restore the file system states correspondingly. For this reason, Condor, 
Mist and other fault-tolerance systems disallow their users access file systems arbitrarily except read-only or 
write-only operations. SCR algorithm, based on atomic operation and consistent schedule, can resolve this 
problem. By using SCR algorithm, users are allowed to use any file operation in their programs. But SCR 
algorithm only supports synchronous checkpointing algorithms now, how to support asynchronous 
checkpointing algorithms is our next work in the future. 
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